This vignette demonstrates how to run CIPR on Seurat objects. If you use CIPR, please cite:

CIPR: a web-based R/shiny app and R package to annotate cell clusters in single cell RNA sequencing experiments

H. Atakan Ekiz, Christopher J. Conley, W. Zac Stephens & Ryan M. O’Connell

BMC Bioinformatics, 2020.

doi: 10.1186/s12859-020-3538-2

Github: https://github.com/atakanekiz/CIPR-Package

Summary

This vignette describes how to use CIPR package with 3k PBMC data freely available from 10X genomics. Here, we recycle the code described in Seurat’s guided clustering tutorial to help users perform analyses from scratch. Using this dataset we will demonstrate the capabilities of CIPR to annotate single cell clusters in single cell RNAseq (scRNAseq) experiments. For further information about other clustering methods, please see Seurat’s comprehensive website

Install CIPR

if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")

# Use this option if you want to build vignettes during installation This can take a long time
# due to the installation of suggested packages.
remotes::install_github("atakanekiz/CIPR-Package", build_vignettes = TRUE)

# Use this if you would like to install the package without vignettes
# remotes::install_github('atakanekiz/CIPR-Package')

Seurat pipeline

Setup Seurat object

library(dplyr)
library(Seurat)
library(SeuratData)
library(CIPR)
# Load data
InstallData("pbmc3k")
pbmc <- pbmc3k

Pre-processing

The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.

# Calculate mitochondrial gene representation (indicative of low quality cells)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# Filter out genes with feature counts outside of 200-2500 range, and >5% mt genes
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

Normalizing data

pbmc <- NormalizeData(pbmc)

Variable gene detection and scaling

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)

Perform PCA

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
ElbowPlot(pbmc)

Cluster cells

pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)

Run non-linear dimensionality reduction (tSNE)

pbmc <- RunTSNE(pbmc, dims = 1:10)
pbmc$unnamed_clusters <- Idents(pbmc)
# saveRDS(pbmc, 'pbmc.rds')

Find differentially expressed genes

This is the step where we generate the input for CIPR’s log fold change (logFC) comparison methods.

allmarkers <- FindAllMarkers(pbmc)

Calculate average gene expression per cluster

This is the step where we generate the input for CIPR’s all-genes correlation methods.

avgexp <- AverageExpression(pbmc)
avgexp <- as.data.frame(x = avgexp$RNA)
avgexp$gene <- rownames(avgexp)

Visualize Seurat pbject

DimPlot(pbmc)

CIPR analysis

The user can select one of the 7 provided reference data sets:

Reference reference argument
Immunological Genome Project (ImmGen) “immgen”
Presorted cell RNAseq (various tissues) “mmrnaseq”
Blueprint/ENCODE “blueprint”
Human Primary Cell Atlas “hpca”
Database of Immune Cell Expression (DICE) “dice”
Hematopoietic differentiation “hema”
Presorted cell RNAseq (PBMC) “hsrnaseq”
User-provided custom reference “custom”

Standard logFC comparison method

In this method CIPR accepts allmarkers data frame created above and performs the following analytical steps:

  • It calculates a vector of logFC values for each reference sample (i.e. individual columns of the reference data frame) by comparing log-normalized expression value of a gene (i.e. rows of the reference data frame) to the average gene expression across the entire reference dataset.
  • It then scores unknown cluster logFC differential gene expression data against this reference logFC values to create a vector of identity scores
  • User can select one of three methods:
    • LogFC dot product (sum of all logFC x logFC values among matching genes). This is the recommended method in CIPR.
    • LogFC Spearman’s correlation (rank correlation of logFC values)
    • LogFC Pearson’s correlation (linear correlation of logFC values)

Plot all identity scores per cluster-reference cell pairs

The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots

CIPR results can be summarized for each cluster in scatter plots.

CIPR(input_dat = allmarkers,
     comp_method = "logfc_dot_product", 
     reference = "hsrnaseq", 
     plot_ind = T,
     plot_top = F, 
     global_results_obj = T, 
     global_plot_obj = T,
     # axis.text.x=element_text(color="red") # arguments to pass to ggplot2::theme() to change plotting parameters
     )

Plot identity scores for a select cluster

ind_clu_plots object is created in the global environment to help users can visualize results for a desired cluster and manipulate graphing parameters. ggplot2 functions can be iteratively added to individual plots to create annotations etc.

library(ggplot2)
ind_clu_plots$cluster6 + theme(axis.text.y = element_text(color = "red"), axis.text.x = element_text(color = "blue")) + 
    labs(fill = "Reference") + ggtitle("Figure S4a. Automated cluster annotation results are shown for cluster 6") + 
    annotate("text", label = "2 sd range", x = 10, y = 700, size = 8, color = "steelblue") + annotate("text", 
    label = "1 sd range", x = 10, y = 200, size = 8, color = "orange2") + geom_rect(aes(xmin = 94, 
    xmax = 99, ymin = 1000, ymax = 1300), fill = NA, size = 3, color = "red")

Plot top scoring refernce subsets for each cluster

CIPR(input_dat = allmarkers, comp_method = "logfc_dot_product", reference = "hsrnaseq", plot_ind = F, 
    plot_top = T, global_results_obj = T, global_plot_obj = T)

Tabulate CIPR results

CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (CIPR_top_results and CIPR_all_results respectively) to allow users to explore the outputs and generate specific plots and tables.

head(CIPR_top_results)
cluster reference_cell_type reference_id long_name description identity_score index z_score percent_pos_correlation
0 CD8+ T cell G4YW_CD8_naive Naive CD8 T cells N/A 837.5106 1 1.657372 87.30684
0 CD8+ T cell DZQV_CD8_naive Naive CD8 T cells N/A 832.9015 2 1.648251 83.44371
0 CD8+ T cell 925L_CD8_naive Naive CD8 T cells N/A 778.8108 3 1.541209 84.87859
0 CD8+ T cell 9JD4_CD8_naive Naive CD8 T cells N/A 750.8086 4 1.485795 82.45033
0 CD4+ T cell 9JD4_CD4_naive Naive CD4 T cells N/A 742.8635 5 1.470072 86.31347
1 Monocyte G4YW_C_mono Classical monocytes N/A 2030.5922 6 2.199703 90.61538
head(CIPR_all_results)
reference_id identity_score reference_cell_type long_name description cluster z_score percent_pos_correlation
DZQV_B_naive -506.4224 B cell Naive B cells N/A 0 -1.0021725 42.16336
DZQV_B_NSM -414.3927 B cell Non-switched memory B cells N/A 0 -0.8200527 43.37748
DZQV_B_Ex -438.5500 B cell Exhausted B cells N/A 0 -0.8678580 43.92936
DZQV_B_SM -441.4376 B cell Switched memory B cells N/A 0 -0.8735724 41.39073
DZQV_Plasmablasts 226.2113 B cell Plasmablasts N/A 0 0.4476555 64.56954
925L_B_naive -128.9296 B cell Naive B cells N/A 0 -0.2551421 61.92053

Standard all-genes correlation method

CIPR also implements a simple correlation approach in which overall correlation in gene expression is calculated for the pairs of unknown clusters and the reference samples (regardless of the differential expression status of the gene). This approach is conceptually similar to some other automated identity prediction pipelines such as SingleR and scMCA.

  • Spearman’s correlation: It calculates correlation based on ranked gene expression. It can be suitable for comparing experimental and reference data which were obtained using different technologies.
  • Pearson’s correlation: It calculates linear correlations. This can be useful when the user would like to provide a custom reference dataset to CIPR.

Plot all identity scores per cluster-reference cell pairs

The code below performs analysis using sorted human PBMC RNAseq data as reference, and plots

CIPR results can be summarized for each cluster in scatter plots.

CIPR(input_dat = avgexp, comp_method = "all_genes_spearman", reference = "hsrnaseq", plot_ind = T, 
    plot_top = F, global_results_obj = T, global_plot_obj = T)

Plot top scoring refernce subsets for each cluster

CIPR(input_dat = avgexp, comp_method = "all_genes_spearman", reference = "hsrnaseq", plot_ind = F, 
    plot_top = T, global_results_obj = T, global_plot_obj = T)

Tabulate CIPR results

CIPR results (both top 5 scoring reference types per cluster and the entire analysis) are saved as global objects (CIPR_top_results and CIPR_all_results respectively) to allow users to explore the outputs and generate specific plots and tables.

head(CIPR_top_results)
cluster reference_cell_type reference_id long_name description identity_score index z_score
0 CD4+ T cell DZQV_CD4_naive Naive CD4 T cells N/A 0.7967446 1 1.390953
0 CD4+ T cell 925L_TFH Follicular helper T cells N/A 0.7929448 2 1.338339
0 CD4+ T cell G4YW_Th1 Th1 cells N/A 0.7880379 3 1.270397
0 CD4+ T cell G4YW_Treg T regulatory cells N/A 0.7864624 4 1.248583
0 CD4+ T cell DZQV_Th17 Th17 cells N/A 0.7803023 5 1.163288
1 Monocyte G4YW_I_mono Intermediate monocytes N/A 0.7844655 6 2.715642
head(CIPR_all_results)
reference_id identity_score reference_cell_type long_name description cluster z_score
DZQV_B_naive 0.6503197 B cell Naive B cells N/A 0 -0.6364848
DZQV_B_NSM 0.6480390 B cell Non-switched memory B cells N/A 0 -0.6680630
DZQV_B_Ex 0.6488979 B cell Exhausted B cells N/A 0 -0.6561705
DZQV_B_SM 0.6961983 B cell Switched memory B cells N/A 0 -0.0012370
DZQV_Plasmablasts 0.6816080 B cell Plasmablasts N/A 0 -0.2032581
925L_B_naive 0.6421836 B cell Naive B cells N/A 0 -0.7491386

Limiting analysis to the select subsets of reference data

Sometimes excluding irrelevant reference cell types from the analysis can be helpful. Especially when the logFC comparison methods are utilized, removing irrelevant subsets may improve discrimination of closely related subsets, since the reference logFC values will be calculated after subsetting the data frame. Filtering out reference subsets should not impact results of the all-genes correlation methods, but it can make the graphical outputs easier to look at

3k PBMC dataset may not be the best example to demonstrate benefits of reference dataset subsetting, but the code below serves as an example for this functionality.

CIPR(input_dat = allmarkers, comp_method = "logfc_dot_product", reference = "hsrnaseq", plot_ind = T, 
    plot_top = F, global_results_obj = T, global_plot_obj = T, select_ref_subsets = c("CD4+ T cell", 
        "CD8+ T cell", "Monocyte", "NK cell"))

Filtering out lowly variable genes

Genes that have a low expression variance across the reference data frame has weaker discriminatory potential. Thus, excluding these genes from the analysis can reduce the noise and improve the prediction scores, especially when using all-genes correlation based methods.

We implemented a variance filtering parameter, keep_top_var, which allows users to keep top Nth% variable reference genes in the analysis. For instance, by setting this argument to 10, CIPR can be instructed to use only the top 10% highly variable genes in identity score calculations. In our experience (Ekiz HA, BMC Bioinformatics, in revision) limiting the analysis to highly variable genes does not significantly impact the identity scores of the top-scoring reference cell subsets, but it reduces the identity scores of intermediate/low-scoring reference cells leading to an improvement of z-scores. The “best” value for this parameter remains to be determined by the user in individual studies.

CIPR(input_dat = avgexp, comp_method = "all_genes_spearman", reference = "hsrnaseq", plot_ind = T, 
    plot_top = F, global_results_obj = T, global_plot_obj = T, keep_top_var = 10)

Session Info

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.3           pbmc3k.SeuratData_3.1.4 CIPR_0.1.0             
## [4] SeuratData_0.2.1        SeuratObject_4.0.1      Seurat_4.0.1           
## [7] dplyr_1.0.6            
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1          backports_1.2.1       plyr_1.8.6           
##   [4] igraph_1.2.6          lazyeval_0.2.2        splines_4.0.4        
##   [7] listenv_0.8.0         scattermore_0.7       digest_0.6.27        
##  [10] htmltools_0.5.1.1     fansi_0.4.2           magrittr_2.0.1       
##  [13] tensor_1.5            cluster_2.1.0         ROCR_1.0-11          
##  [16] openxlsx_4.2.3        limma_3.46.0          remotes_2.3.0        
##  [19] globals_0.14.0        matrixStats_0.58.0    spatstat.sparse_2.0-0
##  [22] prettyunits_1.1.1     colorspace_2.0-1      rappdirs_0.3.3       
##  [25] ggrepel_0.9.1         haven_2.4.1           xfun_0.23            
##  [28] callr_3.7.0           crayon_1.4.1          jsonlite_1.7.2       
##  [31] spatstat.data_2.1-0   survival_3.2-7        zoo_1.8-9            
##  [34] glue_1.4.2            polyclip_1.10-0       gtable_0.3.0         
##  [37] leiden_0.3.7          car_3.0-10            pkgbuild_1.2.0       
##  [40] future.apply_1.7.0    abind_1.4-5           scales_1.1.1         
##  [43] rstatix_0.7.0         miniUI_0.1.1.1        Rcpp_1.0.6           
##  [46] viridisLite_0.4.0     xtable_1.8-4          reticulate_1.20      
##  [49] spatstat.core_2.1-2   foreign_0.8-81        htmlwidgets_1.5.3    
##  [52] httr_1.4.2            RColorBrewer_1.1-2    ellipsis_0.3.2       
##  [55] ica_1.0-2             pkgconfig_2.0.3       farver_2.1.0         
##  [58] sass_0.4.0            uwot_0.1.10           deldir_0.2-10        
##  [61] utf8_1.2.1            tidyselect_1.1.1      labeling_0.4.2       
##  [64] rlang_0.4.11          reshape2_1.4.4        later_1.2.0          
##  [67] cellranger_1.1.0      munsell_0.5.0         tools_4.0.4          
##  [70] cli_2.5.0             generics_0.1.0        broom_0.7.6          
##  [73] ggridges_0.5.3        evaluate_0.14         stringr_1.4.0        
##  [76] fastmap_1.1.0         yaml_2.2.1            goftest_1.2-2        
##  [79] processx_3.5.2        knitr_1.33            fitdistrplus_1.1-3   
##  [82] zip_2.1.1             purrr_0.3.4           RANN_2.6.1           
##  [85] pbapply_1.4-3         future_1.21.0         nlme_3.1-152         
##  [88] mime_0.10             formatR_1.9           compiler_4.0.4       
##  [91] rstudioapi_0.13       plotly_4.9.3          curl_4.3.1           
##  [94] png_0.1-7             ggsignif_0.6.1        spatstat.utils_2.1-0 
##  [97] tibble_3.1.2          bslib_0.2.5.1         stringi_1.6.2        
## [100] highr_0.9             ps_1.6.0              forcats_0.5.1        
## [103] lattice_0.20-41       Matrix_1.3-3          vctrs_0.3.8          
## [106] pillar_1.6.1          lifecycle_1.0.0       jquerylib_0.1.4      
## [109] spatstat.geom_2.1-0   lmtest_0.9-38         RcppAnnoy_0.0.18     
## [112] data.table_1.14.0     cowplot_1.1.1         irlba_2.3.3          
## [115] httpuv_1.6.1          patchwork_1.1.1       R6_2.5.0             
## [118] promises_1.2.0.1      KernSmooth_2.23-18    gridExtra_2.3        
## [121] rio_0.5.26            parallelly_1.25.0     codetools_0.2-18     
## [124] MASS_7.3-53           gtools_3.8.2          rprojroot_2.0.2      
## [127] withr_2.4.2           sctransform_0.3.2     hms_1.1.0            
## [130] mgcv_1.8-33           parallel_4.0.4        grid_4.0.4           
## [133] rpart_4.1-15          tidyr_1.1.3           rmarkdown_2.8        
## [136] carData_3.0-4         Rtsne_0.15            ggpubr_0.4.0         
## [139] shiny_1.6.0